Genome Research — Latest Matching Preprints

1

Fiber-TEnCATS reveals haplotype-specific chromatin accessibility and DNA methylation at human L1HS loci

Pavlovic, K.; McDonald, T. L.; Diehl, A. G.; Switzenberg, J. A.; Boyle, A. P.

2026-06-28 genomics 10.64898/2026.06.26.734832 medRxiv

Top 0.1%

26.6%

Show abstract

Human-specific long interspersed nuclear element-1 (L1HS) is an active and autonomous retrotransposon in the human genome. Changes in its transcription and transposition are known to affect cellular processes involved in development and aging, and diseases such as neurological disorders and cancer. To better understand natural variability in epigenetic patterns that affect L1HS regulation, we developed a targeted long-read method to simultaneously profile individual haplotypes for DNA methylation and chromatin accessibility across L1HS loci in a healthy human cell line trio. We show that the intronic L1HS in the ZNF638 gene consistently displays high chromatin accessibility and DNA hypomethylation with bidirectional transcription. Our approach also reveals additional intronic and intergenic L1HS copies with allele-specific chromatin accessibility and methylation, and instances of reduced promoter DNA methylation that does not correspond with increased chromatin accessibility. We also identify potential cases of non-Mendelian inheritance of DNA methylation patterns over a subset of L1HS promoters. Our methods high coverage over L1HS loci enables detection and profiling of loci that are missed even by long-read-based assemblies and enables more accurate inheritance tracing of L1HS insertions. Overall, our results offer new insights into the locus-specific regulation of both reference and non-reference L1HS within the human genome.

2

Genomic 8-oxoguanine is associated with transcriptionally active chromatin and elevated gene expression in Plasmodium falciparum

Acharya, D.; Vembar, S. S.

2026-06-25 genomics 10.64898/2026.06.21.733641 medRxiv

Top 0.1%

18.8%

Show abstract

Epigenetic regulation is central to the developmental progression and pathogenicity of the unicellular eukaryotic parasite Plasmodium falciparum; yet, the contribution of DNA base modifications remains poorly understood. One such modification, 8-oxoguanine (8-oxoG), which was initially identified as an oxidative lesion and a marker of DNA damage, has since emerged as a transcriptional regulator in advanced eukaryotes. Given that P. falciparum encounters a highly oxidative environment in human blood, we investigated the potential gene regulatory role of 8-oxoG during its intra-erythrocytic developmental cycle (IDC). Using immunodetection assays, we first confirmed the presence of 8-oxoG in P. falciparum genomic DNA and observed a gradual increase in 8-oxoG abundance from ring to schizont stages. We then optimized oxidative DNA immunoprecipitation sequencing (OxiDIP-seq) for the highly AT-rich parasite genome and generated genome-wide 8-oxoG profiles across four IDC timepoints, which revealed reproducible enrichment of 8-oxoG at discrete genomic loci, with more than 50% of the peaks stable across developmental stages. Notably, 8-oxoG accumulated at putative G-quadruplex-forming sequences in the parasite genome and preferentially localized within exonic regions of protein-coding genes, exhibiting a marked enrichment near STOP codons and within 3' untranslated regions. This in turn correlated with significantly higher steady-state transcript levels of 8-oxoG-marked genes, with stage-specific changes in 8-oxoG enrichment closely matching transcriptional activity. Furthermore, 8-oxoG-marked loci were preferentially associated with active and poised histone post-translational modifications, while showing no evidence of altered nucleosome occupancy. Collectively, these findings demonstrate that 8-oxoG is a widespread and non-random DNA modification in P. falciparum and suggest that it may function as an epigenetic mark associated with transcriptionally permissive chromatin and gene activation during parasite blood-stage development.

3

Single molecule footprinting measures low nucleosome occupancy in mature spermatozoa of mice and men

Gaspa-Toneu, L.; Shi, H.; Ozonov, E. A.; Gill, M. E.; De Geyter, C.; Peters, A. H. F. M.

2026-07-01 genomics 10.64898/2026.06.30.735528 medRxiv

Top 0.1%

18.6%

Show abstract

Nucleosomes are fundamental units of DNA packaging and gene regulation in eukaryotes. In mammalian sperm, most nucleosomes are replaced by protamines causing extreme chromatin compaction. Various epigenomic studies reported conflicting results on the distribution of residual nucleosomes in mammalian sperm, questioning their potential role in mediating intergenerational inheritance of paternal epigenetic information. Here we performed single-molecule footprinting through Nucleosome Occupancy and Methylome (NOMe) sequencing and applied the Bayesian statistical model nomeR to determine frequencies of nucleosome removal and retention at 103 specific genomic regions in thousands of developing haploid spermatids and mature spermatozoa of mice. While we readily detected footprints of nucleosomes and the transcription factor CTCF in round spermatids, chromatin became transiently highly accessible in elongating spermatids with loss of such footprints, indicating extensive chromatin reprogramming during spermiogenesis. In mature sperm, following nuclear decondensation with recombinant nucleoplasmin, we measured nucleosome occupancy frequencies ranging ~1.2 to 1.7% at mouse loci. In human sperm, nucleosome occupancy varied between ~2.3 to 4.5% at 163 genomic loci profiled. Contrasting mice, chromatin in ~25% of human sperm was accessible upon reducing disulfide bonds between protamines arguing for species specific protamine packaging. Our findings support a stochastic rather than programmed potential role of residual nucleosomes in mammalian sperm in regulating paternal gene expression during ensuing embryonic development.

4

Lossless compression of k-mer matrices enabling random row access

Regnier, A.; Lemane, T.; Bellenous, S.; Chikhi, R.; Peterlongo, P.

2026-07-08 bioinformatics 10.64898/2026.07.03.736306 medRxiv

Top 0.1%

18.0%

Show abstract

Genomic search engines such as Logan-Search index petabytes of sequencing data as large binary matrices, called k-mer matrices, where each row encodes the presence of a k-mer across thousands to millions of genomic samples. Logan-Search contains a petabyte of binary matrices, and storing them is expensive, yet compression must not prevent fast random access to any matrix row at query time. We present kmcomp, a lossless compression method for k-mer matrices that satisfies these competing requirements. Block compression partitions the matrix into fixed-size row blocks, each compressed independently; block start positions are stored in an Elias-Fano encoded array, enabling O(1) random access to any block. To improve compressibility without introducing additional decompression steps, we introduce the {pi}-compression: a column reordering that groups similar samples together by solving the Traveling Salesman Problem via a nearest-neighbor heuristic. We accelerate this heuristic with a novel variant of the vantage-point tree, the masked vp-tree, which dynamically prunes nearest-neighbor search space. On three (meta)genomic datasets, kmcomp achieves compression ratios of 1.3 to 5.4; {pi}-compression further improves these to 1.5 to 51.3. Applied to the Logan-Search petabyte-scale index, compression reduces storage by approximately half, and {pi}-compression adds a further 13% gain. Query overhead remains modest: queries of hundreds of nucleotides incur an absolute latency increase of {approx} 100 ms, and highly compressed indexes can match uncompressed query times thanks to reduced disk reads.

5

Transcriptional Characterization of Nuclear-Integrated Organellar DNA in Populus

Arneson, R.; Wittstock, W.; Marceau, A.; Yuan, Y.

2026-07-12 genomics 10.64898/2026.07.08.737317 medRxiv

Top 0.2%

17.7%

Show abstract

The continuous transfer of organellar DNA into the nuclear genome during eukaryotic evolution has resulted in the widespread occurrence of nuclear plastid DNA insertions (NUPTs) and nuclear mitochondrial DNA insertions (NUMTs). However, their functional significance in nuclear gene expression and genome evolution remains largely unresolved. In this study, we employed Oxford Nanopore Direct RNA Sequencing (DRS) to investigate the transcription of NUPTs and NUMTs in the Populus nuclear genome and compared their transcriptional characteristics with their genome-wide insertion patterns. Our analyses revealed that the majority of transcribed NUPTs and NUMTs are enriched within introns and are co-transcribed with their host or adjacent genes in polycistronic-like transcriptional units. In addition, NUPTs and NUMTs frequently generate intronless transcripts, features reminiscent of their prokaryotic ancestry. We further identified a putatively functional NUPT-derived psbH gene that is unique to P. trichocarpa, providing new insights into the evolution of nuclear-encoded organelle-targeted genes. In addition, we identified transcribed NUPT and NUMT insertion polymorphisms among alleles, suggesting that organellar DNA insertions contribute to allelic variation and may participate in environmental adaptation. Collectively, our findings reveal previously unrecognized roles of NUPT and NUMT transcription in gene regulation, allelic variation, genome evolution, and the emergence of novel genes.

6

Variation and selection at predicted G-quadruplexes across the human pangenome

Mohanty, S. K.; Marin, M. G.; Smeds, L.; Chiaromonte, F.; Huber, C. D.; Makova, K. D.; Human Pangenome Reference Consortium,

2026-06-23 genomics 10.64898/2026.06.18.733261 medRxiv

Top 0.2%

15.0%

Show abstract

G-quadruplexes (G4s), non-canonical DNA structures whose sequence motifs occupy approximately 1% of the human genome, are important for myriad cellular functions, including regulating transcription and replication. Yet they also contribute to genomic instability by increasing mutations and structural variation. Despite their significance, G4 motifs have not been studied in detail across multiple human genomes. Here, we conducted a comprehensive analysis of presence/absence and sequence variation, measured selection strength, and evaluated gene expression regulation potential for predicted G4s (pG4s) across population groups in the second release of the Human Pangenome Reference Consortium dataset, comprising high-quality, near-telomere-to-telomere diploid genomes from 231 individuals worldwide, along with three reference assemblies. Across the human pangenome, we identified over 353 million pG4s, including 1.15 million pG4s absent from reference assemblies but shared across other haplotypes. Our analysis revealed that pG4 sharing patterns recapitulate human population structure: African individuals displayed lower levels of pG4 sharing than non-Africans, whereas East Asian individuals exhibited higher levels of sharing. By analyzing the site frequency spectrum across various genomic annotations, we computed and compared selection coefficients (Sd) at pG4 vs. non-pG4 sites. As expected, the strongest purifying selection (Sd [≥] 10) was detected at protein-coding exons, where pG4 sites had similar or lower selection coefficients compared with those for pG4 sites. Strikingly, this pattern reversed at regulatory regions: although purifying selection was weaker overall at promoters, introns, enhancers, and replication origins (1 [≤] Sd < 10), pG4 sites at these regions experienced stronger selection than non-pG4 sites--suggesting that pG4s play functional roles outside coding sequences. Additionally, by integrating pG4 data with long-read transcriptome data profiles from this large cohort, we found that pG4s located at promoters and at (or near) exon-intron junctions may influence variation in gene expression levels and transcript isoforms, respectively, across the human pangenome individuals. Leveraging extensive population-scale data, our research illuminates the fundamental importance and functional relevance of G4s across human genomes.

7

Evolutionarily labile pachytene piRNAs target an altered set of mRNAs in male hybrids of house mouse subspecies

Saflund, M.; Askari, M.; Eghbali, A.; Abdi, M. M.; Fitzpatrick, J. L.; Yu, T.; Ozata, D. M.

2026-07-08 evolutionary biology 10.64898/2026.07.08.737336 medRxiv

Top 0.2%

14.9%

Show abstract

During male meiosis-I of placental mammals, ~30-nucleotide pachytene PIWI-interacting RNAs (piRNAs) are expressed to regulate genes required for sperm function. Pachytene piRNA genes evolve rapidly. Whether rapid evolutionary turnover of pachytene piRNAs is under positive selective pressure remains enigmatic. Here, we investigate the evolutionary rate of pachytene piRNA genes over a short evolutionary timescale using geographically isolated mouse subspecies. We demarcate the genes producing postnatal piRNAs in PWK/PhJ and CAST/EiJ. Comparative genomics reveals 16 subspecies-specific pachytene piRNA loci underscoring how labile pachytene piRNA genes are even during short evolutionary timescale. We report a highly abundant CAST/EiJ-specific pi17-CAST locus defying the notion that young pachytene piRNA genes do not produce abundant piRNAs. In fact, male hybrids from the reciprocal crossing C57BL/6J and CAST/EiJ produce pi17-CAST piRNAs almost exclusively from the CAST/EiJ allele suggesting that species-specific nucleotide variants are sufficient to turn a locus into piRNA source. Intriguingly, hybrid males with reduced fertility features retain distinct piRNA-mRNA pairs compared to parents. Our work reveals that rapidly evolving pachytene piRNAs can gain or lose targets in the hybrid males of closely related mammalian species.

8

kmerRRR: A k-mer based tool for functional genomics in Repeat Rich Regions

Rahmat, J.; Pham, T. M.; Larracuente, A. M.

2026-06-25 genomics 10.64898/2026.06.21.732238 medRxiv

Top 0.2%

14.8%

Show abstract

Highly repetitive sequences pose problems for genome assembly and analysis. While advances in long-read sequencing technologies have helped reveal the organization of repetitive genomic sequences at unprecedented resolution, their functional characterization remains difficult because molecular assays that probe protein-DNA interactions and characterize expression often rely on short read sequencing. The repetitive nature of these regions poses major challenges for methods relying on sequence mapping, which is exacerbated for short reads. Repetitive genome regions often have low mappability, leading to substantial information loss during downstream filtering. To address this challenge, we developed a bioinformatic tool--kmerRRR--that leverages k-mer frequency analyses to enhance the mappability of repetitive regions. KmerRRR compares k-mer frequencies within user-defined loci to their frequencies across the genome to identify repetitive sequences that are overrepresented locally relative to the global background. This approach quantifies locus uniqueness, allowing users to distinguish sequences that are globally repetitive from those that are repetitive, but restricted to specific genomic loci. We demonstrated the utility of this method by reanalyzing chromatin profiling data from human, Drosophila, and Arabidopsis centromeres and small RNA sequencing data. Our results show that incorporating local k-mer ratio information enhances read retention and signal interpretation within repetitive regions, thereby recovering biologically meaningful information that is typically lost in conventional analyses. The tool is freely available under MIT license in github: (https://github.com/LarracuenteLab/kmerRRR).

9

Mapping non-coding functional elements in allotetraploid Cyprinus carpio embryo development reveals subgenome variation of transcription regulation

Jimenez-Gonzalez, A.; Madrero Pardo, A.; Hadzhiev, Y.; Blasweiler, A.; Zunar, B.; Csenki-Bakos, Z.; Muller, T.; Megens, H.-J.; Wiegertjes, G. F.; Lenhard, B.; Baranasic, D.; Mueller, F.

2026-06-26 genomics 10.64898/2026.06.22.733679 medRxiv

Top 0.3%

12.4%

Show abstract

Common carp (Cyprinus carpio) is an important freshwater species for ornamental and aquaculture purposes, and a key cyprinid model for studying allotetraploidy. Its two chromosomally-separated subgenomes show distinct gene expression profiles, but how their regulatory landscapes control gene expression dynamics during development remains unknown. We generated a regulatory atlas by combining transcriptomes across 12 developmental stages with chromatin accessibility maps, transcription start sites and gene regulation-associated histone post-translational modifications. Subgenome-specific annotation and comparison of 254,276 developmental regulatory elements (PADREs) revealed that regulatory subgenome divergence is most prominent during early development, converging toward the phylotypic period, mirroring expression convergence between subgenomes at the same stages. This dynamic was driven by enhancers, while promoters maintained a more stable subgenome bias, extending the hourglass model of developmental constraint to allotetraploid subgenome regulation. Subgenome-specific enhancers were preferentially retained in subgenome B, whereas subgenome A shifted toward homeologous enhancer activity near the phylotypic stage, indicating directional regulatory divergence between subgenomes. Comparison with zebrafish revealed high concordance with sequence conservation and that subgenome B retained more ancestral cyprinid regulatory elements than subgenome A. This developmental regulatory atlas provides a foundational resource for investigating cis-regulatory evolution following the fourth round of vertebrate genome duplication.

10

Synonymous codon usage is biased for and against m⁶A RRACH motifs in mammals

Creasey, L. D.; Tauber, E.

2026-07-14 genomics 10.64898/2026.07.09.737571 medRxiv

Top 0.3%

12.4%

Show abstract

RNA methylation at N6-adenosine (m6A) predominantly occurs within RRACH motifs, yet the forces shaping these motifs in coding regions remain unclear. Here we show that the synonymous codon combinations able to form or disrupt RRACH sites are used non-randomly across mammals. Using 13,491 protein-coding genes from 261 species, we identified genes significantly enriched or depleted in RRACH motifs, a pattern consistent with gene-specific selection for or against m6A potential. Genes enriched in RRACH sites were linked to ubiquitin-like conjugation and cell cycle regulation, whereas transmembrane and HOX genes were RRACH-poor, likely reflecting sequence incompatibility with CpG dinucleotides. Cross-species comparison with Caenorhabditis elegans, which lacks mRNA m6A methylation, revealed reciprocal RRACH frequencies, as expected if these motifs are under selection in m6A-competent genomes but evolve without this constraint otherwise. At the codon level, specific amino acid pairs, particularly threonine-ending dyads, were biased toward RRACH-forming codons while others were depleted, indicating that synonymous codon choice is skewed for and against motif formation. RRACH motifs were also non-randomly distributed along coding sequences, depleted near start codons and enriched toward the 3' end, consistent with known m6A profiles. Finally, analysis of cancer mutations revealed tissue-specific gain and loss of RRACH sites, reflecting context-dependent remodeling of methylation potential. Together, these results show that synonymous codon usage is systematically biased for and against m6A RRACH motifs, pointing to an evolutionary coupling between the genetic code and the epitranscriptomic landscape.

11

Scalable and rare-variant aware genome inference across the 1kGP cohort

Ebler, J.; Prodanov, T.; Blair, A.; Lee, S. K.; Ebert, P.; Human Pangenome Reference Consortium, ; Paten, B.; Marschall, T.

2026-07-03 bioinformatics 10.64898/2026.06.29.735275 medRxiv

Top 0.3%

11.9%

Show abstract

Pangenome graphs built from haplotype-resolved de novo assemblies enable accurate analysis of genetic variation. The short-read-based tool PanGenie efficiently genotypes variants discovered in a pangenome across large cohorts and outperforms linear reference-based methods for structural variants (SVs). However, it cannot detect novel variants absent from the graph, missing many rare SVs (allele frequency <1%) and was limited to graphs with 254 haplotypes. First, we introduce a haplotype sampling step that reduces the number of haplotypes using sample-specific k-mers before genotyping, decreasing runtime twelvefold and memory usage 1.4-fold at 30x coverage. Second, we present a polishing workflow that corrects residual errors in haplotypes inferred from PanGenie genotypes and incorporates rare and private mutations. We genotype 3,202 samples from the 1000 Genomes Project and use low-coverage ONT data (967 samples) for polishing. We achieve a median QV of 46 and provide the 1,934 polished haplotype sequences as a community resource.

12

EpiATLAS - a reference for human epigenomic research

International Human Epigenome Consortium, ; Manz, Q.; Bilenky, M.; Hecker, D.; Aggarwal, N.; Arcila-Galvis, J. E.; Ashrafiyan, S.; Baumgarten, N.; Behjati Ardakani, F.; Branco Lins, P. R.; Breeze, C. E.; Brownlee, D.; Bujold, D.; Chapman, A. R.; Chow, S. H.-C.; Dincer, T. U.; Dupras, C.; Frosi, G.; Fu, J.; Gerard, D.; Hauduc, A.; Hyacinthe, J.; Jaroszewicz, A.; Li, R.; Mangan, R. J.; Mikulasova, A.; Moghul, I.; Needhamsen, M.; Palmour, N.; Pires Pacheco, M.; Quon, J.; Raby, J.; Reynolds, A.; Rumpf, L.; Salhab, A.; Shi, C. H.; Sinkkonen, L.; Tanigawa, Y.; Tanner, R. M.; Vu, H.; White, F.; Aw,

2026-06-26 genomics 10.64898/2026.06.22.729579 medRxiv

Top 0.3%

11.6%

Show abstract

The sequence of the human genome provides a foundation for understanding cellular processes in health and disease. The organisation of this primary genetic information into cell-specific structure and function is critical to understanding the cell type-specific interpretation and execution of the genome. Epigenetic processes are essential for packaging and higher-level functional organisation of the genome, and changes therein are increasingly recognised as contributors to human disease. Building on primary data generated by multinational consortia, the International Human Epigenome Consortium (IHEC) has uniformly processed a collection of more than 2000 comprehensive human reference epigenomes, collectively referred to as EpiATLAS. This effort involved the development of standardised molecular and bioinformatics protocols, metadata models, and analytical tools to manage, integrate, display, and share vast amounts of epigenomic data. This includes the creation of a publicly available Epigenome Reference Registry, which provides a system for accessing protected human subject datasets and facilitates open searching of de-identified samples and experimental data. The integrated EpiATLAS ecosystem and its comprehensive human reference epigenome maps provide an unprecedented resource for the biosciences, expanding the annotated epigenomic landscape while uncovering previously unappreciated relationships among regulatory layers and revealing how epigenetic inputs underpin fundamental cellular functions and disease associations.

13

Fertility Gene Introns Harbor Transposable Elements that Shape Y-Loop Architecture

Beard, E. K.; Gamer, J. P.; Raz, A.; Inaba, M.

2026-07-12 genomics 10.64898/2026.07.08.737335 medRxiv

Top 0.3%

11.6%

Show abstract

Transposable elements (TEs) are powerful drivers of genome evolution, yet how they persist under selection and become incorporated into host regulatory networks remains poorly understood. In the Drosophila male germline, TEs are highly expressed during the spermatocyte stage, coinciding with activation of giant fertility genes on the Y chromosome. These genes contain megabase-scale introns enriched for repetitive DNA, and three of these genes form prominent nuclear structures known as Y-loops, providing a unique system to investigate gene regulation. Here, we show that multiple TEs expressed in spermatocytes are transcribed from the introns of Y-linked fertility genes. RNA fluorescence in situ hybridization (FISH) targeting several TEs, including accord2, Juan, and HMS Beagle, illuminates distinct nuclear regions corresponding to kl-2, kl-3, and kl-5, respectively. Genetic perturbation of these fertility gene loci or disruption of RNA-processing factors eliminates these TE transcripts, demonstrating that these TE sequences are embedded within Y chromosome-associated nascent transcripts rather than being independently transcribed. The identity and expression patterns of Y-loop-associated TEs vary extensively among closely related Drosophila species, consistent with the previously documented rapid evolution of Y-linked loci and suggesting that TEs may contribute to the genetic diversification of these giant fertility genes. We propose that continual turnover of repetitive elements within Y-linked introns provides a mechanism by which rapidly evolving repetitive DNA influences germline gene regulation, male fertility, and speciation.

14

Fly Viral Atlas: A single-nucleus transcriptomic atlas of RNA viruses and transposable elements (TEs) in Drosophila melanogaster

Roy, N.; Unckless, R. L.

2026-07-01 genomics 10.64898/2026.06.28.735102 medRxiv

Top 0.3%

11.6%

Show abstract

Drosophila RNA viruses often persist in wild and lab populations, yet their tissue and cellular tropism is poorly understood. In the Fly Cell Atlas (a comprehensive Drosophila single-nucleus transcriptome) data, we detected four RNA virus infections: Nora virus, Drosophila A virus, Drosophila C virus, and Newfield virus. Nora and Drosophila A virus were the most abundant and widespread across tissues and cell types, while Drosophila C virus and Newfield virus RNA transcript were only found in oenocyte and fat body tissues. We found transcriptional changes associated with viral infection in canonical viral immunity genes (e.g. Vago, vir-1). Additionally, we observed that during persistent viral infections, transposable element (TE) transcripts were upregulated in somatic cells. TEs are traditionally associated with the germline, but recent studies and our data suggest they are also expressed in somatic cells. Using the Fly Cell Atlas data, we found that distinct somatic cell types express specific TE subtypes, indicating regulated and cell-type specific TE activity often overlooked in transcriptomic studies. We present Fly Viral Atlas (https://flyviralatlas.shinyapps.io/home/), a single-nucleus level atlas of RNA viruses and TE expressions in Drosophila, providing new insights into viral tropism and TE dynamics across cell types and tissues.

15

Accurate, comprehensive gene annotation and ortholog identification across thousands of vertebrate genomes with TOGA2

Malovichko, Y. V.; Bein, B.; Gonzales-Irribarren, A.; Leushkin, E.; Hilgers, L.; Stephen, A.; Yi, X.; Albertini, M.; Stadager, T.; Zumpt, M.; Hoppach, L.; Goetz, F.; Himstedt, N.; Koch, L.; VGP, ; Hiller, M.

2026-07-04 genomics 10.64898/2026.06.30.735536 medRxiv

Top 0.4%

10.9%

Show abstract

Inferring orthologs and annotating coding genes remain central challenges in genomics, evident by the growing gap between assembled and annotated genomes. TOGA (Tool to infer Orthologs from Genome Alignments) addresses this challenge by integrating gene annotation and orthology inference. Here, we present TOGA2, the next generation of TOGA, which substantially improves annotation completeness, accuracy, scalability, and orthology inference. TOGA2 leverages exon-level orthology and introduces an exon-wise annotation procedure that reduces memory usage 513-fold and runtime 6.1-fold. We show that human-trained deep learning models for splice site prediction generalize across vertebrates. Integrating these predictions enables robust handling of evolutionary changes in exon-intron structure, including splice site shifts, intron deletions, and exonization of introns. A new gene tree reconciliation step refines orthology inference, and UTR annotation improves gene model completeness. Across mammals, birds, turtles, and percomorph fishes, TOGA2 annotations generally achieve higher gene completeness than transcriptome-informed RefSeq annotations. TOGA2 identifies previously unannotated exons in mouse, assigns informative gene symbols, and annotates V(D)J segments of antigen receptors. TOGA2 scales to thousands of genomes, which we demonstrate by generating comprehensive comparative genomics resources for 2,162 vertebrate assemblies, including gene annotations, ortholog sets, gene losses and duplications, retrogene candidates, and outputs supporting downstream analyses. Together, TOGA2 provides a scalable and versatile framework for comparative genomics that bridges the genome annotation gap.

16

Interactome Specialization Predicts Genome-Wide Binding-Site Degeneracy in Drosophila melanogaster Transcription Factors

Ponnambalam, A.; Venkiteswaran Pottore, K.

2026-07-11 genomics 10.64898/2026.07.08.737203 medRxiv

Top 0.4%

10.8%

Show abstract

Transcription factors (TFs) recognize short, degenerate DNA motifs that occur thousands of times throughout the genome, implying that binding specificity depends not only on DNA sequence but also on cellular context, including selective protein-protein interactions. Here, we tested whether a TFs integration into the physical TF interaction network predicts the degeneracy of its DNA-binding motif. Using 279 Drosophila melanogaster TFs with matched JASPAR position weight matrices, FlyBase expression profiles, and a physical protein interaction network derived from STRING v12.0 using only experimental and curated-database evidence, we quantified each TFs TF-module fraction. We compared it with genome-wide predicted binding-site density across an independently constructed 18 Mb genomic sample.TF module fraction showed a significant positive association with binding-site density (partial r = 0.379, P = 5.6e-11)after controlling for motif information content, network degree, and literature bias. The relationship remained significant after excluding the homeodomain family, adding motif architecture controls, and applying multiple robustness analyses, including family-cluster bootstrapping and outlier-resistant correlation tests. Consistent with these findings, TFs formed a highly interconnected physical interaction network far exceeding degree-matched random expectation. Together, these results support a model in which DNA-recognition specificity and protein-interaction specificity represent complementary components of TF targeting: TFs embedded within TF-rich interaction modules tend to possess more degenerate DNA-binding motifs, whereas broadly acting network-generalist TFs rely on more information-rich sequence recognition. We also identify and correct a motif-length-dependent thresholding artifact that can obscure this relationship in genome-wide motif analyses.

17

High-fidelity rare structural variant detection with HiFiRE3 reduced representation via restriction enzyme ends

Stewart, J. A.; Mishler, J.; Ahmed, S.; Schwer, B.; Glover, T. W.; Wilson, T. E.

2026-06-29 genomics 10.64898/2026.06.24.734375 medRxiv

Top 0.4%

10.7%

Show abstract

High-fidelity detection of rare structural variants (SVs) remains challenging because library preparation and sequencing techniques generate artifactual junctions that obscure true single-molecule events. Here, we present HiFiRe3, an error-minimized sequencing framework that combines artifact-aware library design with error suppression and correction strategies to enable rare SV detection and frequency assessment across long and short-read sequencing platforms. We first systematically characterized major classes of SV artifacts, including chimeric PCR products, intermolecular ligation, sequencing platform-specific artifacts, and mapping errors. HiFiRe3 supports error correction of these artifact junctions by combining reduced representation restriction fragments with pre-ligation size selection to enable computational filtering via independent forced restriction enzyme end (FREE) and <1N size logics. In nanopore libraries, these approaches enabled targeted detection of single-molecule SVs at replication stress hotspots in cultured human cells exposed to genotoxicants and in long genes in untreated mouse brains, while markedly reducing singleton translocation artifacts. HiFiRe3 extends to PacBio sequencing for joint SV and SNV error correction and to short-read platforms for cost-efficient high-fidelity nonhomologous SV analysis. Together, HiFiRe3 is a flexible framework for accurately detecting rare genomic structural variation with broad applicability to targeted and genome-wide studies by selective application of its error correction approaches.

18

Benchmarking large language models for ACMG/AMP variant interpretation and variant calling

Corpas, M.

2026-07-05 genomics 10.64898/2026.06.30.735646 medRxiv

Top 0.4%

10.7%

Show abstract

Agentic large language models are increasingly used across the genomic workflow, from variant calling to clinical interpretation, yet they are evaluated by accuracy alone, a single figure that cannot say whether a system is safe or where in the workflow a failure originates. We present ClawBench, a framework that attributes each outcome to the architectural layer that produced it across both halves of the canonical pipeline. Two design choices remove the confounds that make agentic genomics hard to evaluate: a temporally blinded truth set, in which every scored ClinVar label first became available only after the training cutoff of every model tested, and a fail-closed evidence contract that blocks evidence circular with the truth label. We score validity, safety, provenance and reproducibility, not accuracy alone, under a constraint gradient that relocates correctness from a model's prior into executed, validated code. We show three things. First, dangerous misclassification is rare and model-invariant, a controlled precondition of the executed architecture rather than a frontier, while fabricated evidence is measurable and is neutralised by execution. Second, different variant classes are rate-limited by different layers: loss-of-function variants by the deterministic combiner threshold, and rare missense by evidence formation, where evidence acquisition is asymmetric and capped and strength assignment is a recoverable layer that naive strength-licensing prompts confound. Third, for variant calling the arms separate not on whether a model can plan a pipeline, which all do, but on trust properties, pinning, provenance, auditability and reproducibility, which climb monotonically toward validated execution; and a local open-weight model reproduces the safety result yet meets the structured-output and provenance contract far less often than frontier models, a conformance gap rather than a capability or safety gap. An end-to-end join attributes failures across the whole workflow, separating a missed call from a propagated genotype error from a correctly called but misinterpreted variant.

19

The Blood RNA Stability Atlas: defining temporal structure and trait-state programs in the human whole-blood transcriptome

Baltazar, W. C.; Messing, R. O.; Ferguson, L.

2026-07-08 genomics 10.64898/2026.07.07.737044 medRxiv

Top 0.4%

9.8%

Show abstract

Whole-blood RNA biomarkers are widely used for diagnosis and disease monitoring, but their utility depends not only on abundance but also on temporal stability, a property that is not routinely incorporated into biomarker design. We analyzed 968 longitudinal whole-blood transcriptomes from 165 healthy individuals across eight independent studies spanning diverse platforms, time scales (50 minutes to 16 weeks), and common environmental exposures. Using a cross-study analytical framework integrating variance partitioning, repeatability, and time-associated differential expression, we quantified temporal stability for 6,064 RNAs and classified transcripts into "trait" (stable) and "state" (dynamic) categories representing the extremes of longitudinal changes in transcript abundance. We identified 1,118 trait RNAs exhibiting stable within-individual levels of abundance but substantial inter-individual variability, enriched for whole-blood eGenes (P = 6.0 x 10-20), supporting a genetic basis for stability. In contrast, 1,504 state RNAs showed context-dependent temporal variation and were enriched for translation and RNA-binding pathways. Integration with genetic datasets revealed that 4,395 (72%) blood transcripts were linked to at least one whole-blood eQTL, collectively associated with 18,358 GWAS trait relationships, providing disease-relevant context for transcript stability. We developed the Blood RNA Stability Atlas to integrate these features and demonstrate both top-down (disease-to-gene) and bottom-up (gene-to-context) applications for biomarker prioritization and interpretation. These findings establish temporal stability as a defining property of the blood transcriptome and provide a practical, publicly accessible framework for distinguishing stable baseline abundance levels from context-dependent transcriptional responses, informing biomarker selection, study design, and hypothesis generation.

20

Deconvolving the spatiotemporal chromatin landscape through the cell cycle

Tran, T. Q.; Li, Y.; MacAlpine, D. M.; Hartemink, A. J.

2026-07-10 bioinformatics 10.64898/2026.07.07.731387 medRxiv

Top 0.4%

9.6%

Show abstract

Profiling genomic processes during the cell cycle is challenging because synchronized populations gradually lose synchrony as individual cells progress through the cycle at different rates and divide asymmetrically. Researchers have addressed this challenge by modeling the loss of synchrony and applying sophisticated branching process deconvolution methods to mitigate the effects of imperfect synchrony. Such methods have been used to deconvolve cell cycle transcription, but despite the central role of the chromatin landscape in orchestrating eukaryotic transcription and replication, comparable approaches to deconvolving cell cycle chromatin occupancy have not been developed, because the data are orders of magnitude larger and because DNA replication introduces non-uniform copy number effects across the genome during S phase. We present CyCLOPS, a computational framework that overcomes these technical challenges, enabling deconvolution of the genome-wide chromatin landscape throughout the cell cycle at high spatiotemporal resolution. We apply CyCLOPS to MNase-seq data collected from synchronized yeast populations at 10-minute intervals to produce the first dynamic atlas of genome-wide chromatin occupancy through the cell cycle, profiled at sub-minute resolution. We identify functional groups of cell cycle genes through chromatin-based clustering and uncover chromatin regulatory dynamics, including at non-genic loci. Our atlas reveals that chromatin occupancy and transcription fluctuate largely independently.